# loading fundamental packages
library(ggplot2)
library(dplyr)
library(readr)

Introduction & Motivation

A single logo or color on an infographic or data visualization instantly draws our attention to a specific data point. Although not a necessity, colors and aesthetics are a bonus when it comes to visualizing data and conveying a message in your data analysis. For instance, let’s say you wanted to create a report and analyze the outcomes of the US presidential elections. It is graphically appealing to look at graphs with clear labels of red and blue, depicting the two parties, rather than looking at black and white graphs with no distinct characteristics.

Certainly, graphs could be generated using base R plotting with plot(), hist(), boxplot(), etc., but the graphs do not convey a visually compelling story. However, the R package ggplot2 is more convenient to use than base plotting because it produces automated legends, colors, and faceting. As someone who is an advocate of graphically appealing data visualizations that convey a meaningful message, ggplot2 is a great package for data analysts to utilize.

In class, using ggplot2, we have predominantly explored how to generate scatterplots (with loess lines), bar graphs, and line graphs. Therefore, the purpose of this post will serve as an extension to what we learned in class. We will explore new ‘geom’ functions, and learn how to generate graphs we have not covered in class, such as heat maps, using interesting data sets. Ultimately, we will look into ggplot graphs that are useful in real-life applications.

The following image is an example of a graph generated using the ggplot2 package:

# an example of a bar graph made using ggplot2 package
knitr::include_graphics("https://plot.ly/~RPlotBot/4026/migration-to-the-united-states-by-source-region-1820-2006-in-millions.png")

Background

The ggplot2 package was created by Hadley Wickham in 2005 and is an application of Leland Wilkinson’s Grammar of Graphics. According to Wickham, ggplot2 “provides beautiful, hassle-free plots, that take care of fiddly details like drawing legends” and “is designed to work in a layered fashion, starting with a layer showing the raw data then adding layers of annotations and statistical summaries” [1]. Since 2005, ggplot2 has grown to be one of the most popular packages in R. Over the years, numerous changes have been made to improve ggplot2. On March 2nd, 2012, “ggplot2 version 0.9.0 was released with numerous changes to internal organization, scale construction and layers” [2]. ggplot2 package is currently in the maintenance stage, so it will continue to enhance its existing features, but not necessarily develop new ones.

Examples

1. Heatmap Plotting

Heat maps are graphical representations of data where individual values contained in the matrix are represented as colors. Heat maps are useful in real-life applications because it provides an immediate visual summary of the data. The forest fires data set focuses on forest fires that occurred in Montesinho Park in Portugal. The following code walks through how to generate a heat map using ggplot2. To plot this graph, I received guidance from this source.

The following image is an example of a heat map:

We will use the Forest Fires data set from UCI’s Machine Learning repository.

Here are the following column labels of the data set:

1. X: x-axis spatial coordinate within the Montesinho park map: 1 to 9
2. Y: y-axis spatial coordinate within the Montesinho park map: 2 to 9
3. month: month of the year: ‘jan’ to ‘dec’
4. day: day of the week: ‘mon’ to ‘sun’
5. FFMC: Fine Fuel Moisture Code index from the Forest Fire Weather Index (FWI) system: 18.7 to 96.20
6. DMC: Duff Moisture Code index from the FWI system: 1.1 to 291.3
7. DC: Drought Code index from the FWI system: 7.9 to 860.6
8. ISI: Initial Spread Index from the FWI system: 0.0 to 56.10
9. temp: temperature in Celsius degrees: 2.2 to 33.30
10. RH: relative humidity in %: 15.0 to 100
11. wind: wind speed in km/h: 0.40 to 9.40
12. rain: outside rain in mm/m2 : 0.0 to 6.4
13. area: the burned area of the forest (in ha): 0.00 to 1090.84

Once we load our data set, we removed irrelevant columns such as rain, area, X, and Y. The values for ‘rain’ and ‘area’ are mostly zero, and also do not describe the conditions of the fire, so we would like to take them out of the data set. The spatial coordinates, X and Y, also do not describe the fire, so we omit those as well.

# loading the forest fires data set and reading csv file
forestfires <- read_csv("data/forestfires.csv")
## Parsed with column specification:
## cols(
##   X = col_integer(),
##   Y = col_integer(),
##   month = col_character(),
##   day = col_character(),
##   FFMC = col_double(),
##   DMC = col_double(),
##   DC = col_double(),
##   ISI = col_double(),
##   temp = col_double(),
##   RH = col_integer(),
##   wind = col_double(),
##   rain = col_double(),
##   area = col_double()
## )
# took our irrelevant columns
forestfires <-forestfires %>% select(-rain, -area, -X, -Y)

Forest fires are ordered by temperature in Celcius, and the month variable converted to a factor that ensures proper sorting of the plot. The forest fire statistics have different ranges, so we scale the individual statistics to make them comparable.

# loading packages needed for scaling
library(plyr)
library(reshape2)

# sorting data by month
forestfires$day <- with(forestfires, reorder(month, temp))

# individual statistics are scaled so that they are displayed nicely on heatmap
forestfires.m <- melt(forestfires)

forestfires.s <- ddply(forestfires.m, .(variable), transform, rescale = sqrt(value))
last_plot() %+% forestfires.s

To create the heatmap plot, we combined geom_tile() with a smooth gradient fill. You can see that the legend is a rescaled version of the values in the data set. Because the values vary in size, the legend is simply a square rooted version of the values of the data.

# create heatmap using ggplot2
ggplot(forestfires.s, aes(variable, month)) + geom_tile(aes(fill = rescale), 
            color = "white") + ggtitle("Forest Fire Characteristics in Montesinho Park") + scale_fill_gradient2(low = "black", high = "red")

2. Exploring geom_jitter()

Jittering is particularly useful for small datasets with at least one discrete position. geom_jitter() “adds a small amount of random variation to the location of each point, and is a useful way of handling overplotting caused by discreteness in smaller datasets” [3].

We use the same forest fires data set in this example.

Since our forest fire data set isn’t large, the geom_jitter() function also shows a nice depiction of the different temperatures of the fires during the different months of the year.

# creating a jitter geom with the same forest fire data set
ggplot(forestfires, aes(month, temp)) + ggtitle("Forest Fire Temperature (by Month)") +
  geom_jitter()

3. Lollipop Charts

head(midwest)
## # A tibble: 6 x 28
##     PID    county state  area poptotal popdensity popwhite popblack
##   <int>     <chr> <chr> <dbl>    <int>      <dbl>    <int>    <int>
## 1   561     ADAMS    IL 0.052    66090  1270.9615    63917     1702
## 2   562 ALEXANDER    IL 0.014    10626   759.0000     7054     3496
## 3   563      BOND    IL 0.022    14991   681.4091    14477      429
## 4   564     BOONE    IL 0.017    30806  1812.1176    29344      127
## 5   565     BROWN    IL 0.018     5836   324.2222     5264      547
## 6   566    BUREAU    IL 0.050    35688   713.7600    35157       50
## # ... with 20 more variables: popamerindian <int>, popasian <int>,
## #   popother <int>, percwhite <dbl>, percblack <dbl>, percamerindan <dbl>,
## #   percasian <dbl>, percother <dbl>, popadults <int>, perchsd <dbl>,
## #   percollege <dbl>, percprof <dbl>, poppovertyknown <int>,
## #   percpovertyknown <dbl>, percbelowpoverty <dbl>,
## #   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## #   percelderlypoverty <dbl>, inmetro <int>, category <chr>
illinois_top10 <- midwest %>%
        filter(state == "IL") %>%
        select(county, percollege) %>%
        arrange(desc(percollege)) %>%
        top_n(10) %>%
        arrange(percollege) %>%
        mutate(county = factor(county, levels = .$county))
## Selecting by percollege
ggplot(illinois_top10, aes(percollege, county)) +
        geom_segment(aes(x = 0, y = county, xend = percollege, yend = county), color = "grey50") +
        geom_point()

illinois <- midwest %>%
        filter(state == "IL") %>%
        select(county, percollege) %>%
        arrange(percollege) %>%
        mutate(Avg = mean(percollege, na.rm = TRUE),
               Above = ifelse(percollege - Avg > 0, TRUE, FALSE),
               county = factor(county, levels = .$county))

ggplot(illinois, aes(percollege, county, color = Above)) +
        geom_segment(aes(x = Avg, y = county, xend = percollege, yend = county), color = "grey50") +
        geom_point()

4. Spatial Visualization

The ggmap package is another very useful package that could be used in real-life applications. This package interacts with google maps api and get the coordinates (latitude and longitude) of places you want to plot. The below example shows satellite, road, and hybrid maps of Seoul, encircling some famous tourist locations in the city. The geocode() function is used to get the coordinates of the places and qmap() is used to to get the maps.

# loading packages for spatial plotting
library(ggmap)
library(ggalt)

# get Seoul's longitude and latitude coordinates
seoul <-  geocode("Seoul")

# get the map
# google satellite map
seoul_sat_map <- qmap("seoul", zoom = 13, source = "google", maptype = "satellite")

# google road map
seoul_road_map <- qmap("seoul", zoom = 13, source = "google", maptype = "roadmap")

# google hybrid map
seoul_hybrid_map <- qmap("seoul", zoom = 13, source = "google", maptype = "hybrid")

# get coordinates for places in seoul
seoul_places <- c("Namsan",
                    "Gyeongbukgung",
                    "N Seoul Tower",
                    "Myeong-dong")

# get longitudes and latitudes of places in seoul
places_loc <- geocode(seoul_places)

# google road map of seoul
seoul_road_map + geom_point(aes(x = lon, y = lat),
                                  data = places_loc, 
                                  alpha = 0.7, 
                                  size = 7, 
                                  color = "red") + 
                       geom_encircle(aes(x = lon, y = lat),
                                     data = places_loc, size = 2, color = "blue")

# google hybrid map of seoul
seoul_hybrid_map + geom_point(aes(x = lon, y = lat),
                                     data = places_loc, 
                                     alpha = 0.7, 
                                     size = 7, 
                                     color = "red") + 
                          geom_encircle(aes(x = lon, y = lat),
                                        data = places_loc, size = 2, color = "blue")

Discussion

Conclusion

References

[1] http://moderngraphics11.pbworks.com/f/ggplot2-Book09hWickham.pdf

[2] https://en.wikipedia.org/wiki/Ggplot2

[3] http://archive.ics.uci.edu/ml/datasets/Forest+Fires http://ggplot2.tidyverse.org/reference/geom_jitter.html [4] https://www.r-bloggers.com/how-to-make-a-simple-heatmap-in-ggplot2/

[5] http://r4stats.com/examples/graphics-ggplot2/

[6] https://en.wikipedia.org/wiki/Ggplot2

[7]